library(tidyverse)
library(latex2exp)
library(gridExtra)
library(wesanderson)
library(broom)
theme_set(theme_minimal())
Distinguish between (1) residual and semistudentized residual, (2) \(\text{E}[\varepsilon_i] = 0\) and \(\overline{e} = 0\) and (3) error term and residual.
Answer: The semistudentized residual is the residual divided by an approximation of the standard deviation of the residual itself. \(\overline{e} = 0\) is the declaration that the mean of the residuals calculated from the simple linear regression model is zero while \(\text{E}[\varepsilon_i] = 0\) is the declaration that the true errors have an expected value of \(0\). The difference between error term and residual is that the residual is meant to be the observed error between the observed value and the fitted value while the error term is is meant to be the true error error in the regression model.
Prepare a prototype residual plot for each of the following cases: (1) error variance decreases with \(X\); (2) true regression function is \(\bigcup\) shaped, but a linear regression function is fitted.
Answer:
lm.fit_manual = function(X, Y){
b1 = sum((X - mean(X))*(Y - mean(Y))) / (sum((X - mean(X))^2))
b0 = mean(Y) - b1*mean(X)
return(c(b0, b1))
}
In this scenario, the error variance decreases with \(X\):
set.seed(2020)
x = sample(1:100, 100, replace = TRUE)
error = rnorm(mean = x, sd = 100-x, n = 100)
y = rexp(x) + error
model = lm.fit_manual(x, y)
pred = model[1] + model[2]*x
resid = y - pred
df = data.frame(x, y, pred, resid)
plot1 = df %>% ggplot(aes(x=x, y=y)) +
geom_point(size = .8) +
geom_abline(intercept = model[1], slope = model[2], color = "firebrick") +
labs(x = "x", y = "y", title = "Scatter Plot",
subtitle = "where error variance decreases with x")
plot2 = df %>% ggplot(aes(x=x, y=resid)) +
geom_point(size = .8) +
geom_hline(yintercept = 0, color = "firebrick") +
labs(x = "x", y = "residual",
title = "Residual Plot",
subtitle = "where error variance decreases with x")
grid.arrange(plot1, plot2, ncol = 2)
In this scenario, the true regression function is \(\bigcup\) shaped but a linear regression function is fitted.
y = x^2
model = lm.fit_manual(x, y)
pred = model[1] + model[2]*x
resid = y - pred
df = data.frame(x, y, pred, resid)
plot1 = df %>% ggplot(aes(x=x, y=y)) +
geom_point(size = .8) +
geom_abline(intercept = model[1], slope = model[2], color = "firebrick") +
labs(x = "x", y = "y", title = "Scatter Plot",
subtitle = "where there is a misspecification")
plot2 = df %>% ggplot(aes(x=x, y=resid)) +
geom_point(size = .8) +
geom_hline(yintercept = 0, color = "firebrick") +
labs(x = "x", y = "residual",
title = "Residual Plot",
subtitle = "where there is a misspecification")
grid.arrange(plot1, plot2, ncol = 2)
Refer to Grade point average Problem 1.19.
Answer: The box plot is plotted below.
gpa = read.csv('CH01PR19.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
gpa %>%
ggplot(aes(x = '', y = x)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'ACT Scores', title = "Box Plot of ACT Scores") +
coord_flip()
The median ACT score is 25 while scores between 21 and 28 lie between the first and third quartiles. There are no outliers.
Answer: The dot plot of the residuals is plotted below.
model = lm.fit_manual(gpa$x, gpa$y)
pred = model[1] + model[2]*gpa$x
resid = gpa$y - pred
gpa_resids = data.frame(x = gpa$x, y = gpa$y, pred, resid)
gpa_resids %>% ggplot(aes(x = resid)) +
geom_dotplot(binwidth = .1, fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Residuals",
title = "Dot Plot of the Residuals", subtitle = "of Regressing ACT Scores on GPA")
The residuals are skewed to the left and centered around \(0\). There are two large residuals in the negatives.
Answer: The residuals are plotted against the fitted values below.
gpa_resids %>% ggplot(aes(x = pred, y = resid)) +
geom_point() +
labs(x = "Yhat", y = "e_i",
title = "Residual Plot", subtitle = "of Regressing ACT Scores on GPA")
The residuals are distributed randomly. A linear relationship between GPA and ACT scores is probable since there is no pattern in the distribution of the residuals. There appears to be a constant variance.
Answer: The normal probability plot is plotted below.
prob_plot_table = function(df){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
n = nrow(df)
resids_df = data.frame(x = df$x, y = df$y, pred, resid)
mse_sq_root = sqrt(sum((resids_df$y - resids_df$pred)^2) / (n - 2))
resids_df = resids_df %>% mutate("Run" = 1:nrow(resids_df),
"k" = rank(resid)) %>%
mutate("exp_val" = mse_sq_root * qnorm((k - .375)/(n + .25)))
return(resids_df)
}
prob_plot = function(df, subtitle){
prob_plot_table(df) %>%
ggplot(aes(x = resid, y = exp_val)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
labs(x = "Residuals", y = "Expected Value of Residuals Under Normality",
title = "Normal Probablity Plot",
subtitle = subtitle)
}
The normal probability plot of the residuals in estimating GPA using ACT scores is shown below.
prob_plot(gpa, "GPA Scores")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(gpa) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9737275
## exp_val 0.9737275 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=120\), is \(.9889\). Since the calculated correlation coefficient is less than the critical value, it can be stated that there is no evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points do not line up.
Answer:
brown.forsythe = function(df, split, alpha){
df = prob_plot_table(df)
n = nrow(df)
group1 = df[df$x < split,]
n1 = nrow(group1)
e1_median = median(group1$resid)
d1 = abs(group1$resid - e1_median)
d1_bar = mean(d1)
group2 = df[df$x >= split,]
n2 = nrow(group2)
e2_median = median(group2$resi)
d2 = abs(group2$resid - e2_median)
d2_bar = mean(d2)
s = sqrt((sum((d1 - d1_bar)^2) + sum((d2 - d2_bar)^2))/(n-2))
t_ast = abs(round((d1_bar - d2_bar) / (s * sqrt((1/n1) + (1/n2))), 3))
if(t_ast < qt(1 - (alpha/2), n-2)){
paste("At the alpha level of", alpha, "the test statistic is", t_ast, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", t_ast, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(gpa, 26, 0.01)
## [1] "At the alpha level of 0.01 the test statistic is 0.897 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Answer:
gpa_fulldata = read.csv("CH03PR03.txt", sep = '', header = FALSE,
col.names = c("y", "x1", "x2", "x3"),
colClasses = rep("numeric", 4))
gpa_fulldata_resids = merge(prob_plot_table(gpa), gpa_fulldata)
plot_1 = gpa_fulldata_resids %>% ggplot(aes(x = x2, y = resid)) + geom_point() +
labs(x = "Intelligence Test Score", y = "Residuals",
title = "Residuals against Intellligence Test Score")
plot_2 = gpa_fulldata_resids %>% ggplot(aes(x = x3, y = resid)) + geom_point() +
labs(x = "High School Class Rank Percentile", y = "Residuals",
title = "Residuals against HS Class Rank Percentile")
grid.arrange(plot_1, plot_2, nrow = 2)
The residuals increase as intelligence test scores increase and stay relatively constant with increasing high school class rank percentile.
Refer to Copier maintenance Problem 1.20.
Answer:
The dot plot for the number of copiers serviced is shown below.
copier = read.csv('CH01PR20.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
copier %>%
ggplot(aes(x = x)) + geom_dotplot(fill = "darkseagreen", color = "black") +
scale_x_continuous(breaks = seq(0, 10, by = 2)) +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Number of Copiers", y = '', title = "Dot Plot of the Number of Copiers Serviced")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
The number of copiers serviced appears to be distributed slightly right skewed. There are no outlying cases with respect to this variable.
Answer:
The time plot for the number of copiers serviced is shown below.
copier %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x=time, y = x)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Number of Copiers",
title = "Number of Copiers Serviced", subtitle = "in time order")
As time increases, there does not appear to be any overall trend in the number of copiers being serviced. It is noted however that number of copiers serviced increases and decreases roughly every 3/4 measurement of time.
Answer:
The stem-and-leaf plot of the residuals is shown below.
model = lm.fit_manual(copier$x, copier$y)
pred = model[1] + model[2]*copier$x
resid = copier$y - pred
df = data.frame(copier$x, copier$y, pred, resid)
stem(df$resid)
##
## The decimal point is 1 digit(s) to the right of the |
##
## -2 | 30
## -1 |
## -1 | 3110
## -0 | 99997
## -0 | 44333222111
## 0 | 001123334
## 0 | 5666779
## 1 | 112234
## 1 | 5
The residuals appear to hover around 0 and are normally distributed.
Answer:
The residual plots of \(e_i\) versus \(\hat{Y}_i\) and \(e_i\) versus \(X_i\) is shown below.
plot_1 = df %>% ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Number of Copiers Serviced and Service Time")
plot_2 = df %>% ggplot(aes(x = copier.x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "number of copiers", y = "residual",
title = "Number of Copiers Serviced vs Residual")
grid.arrange(plot_1, plot_2, nrow = 2)
These two plots provide the same information. The nonlinearity of the regression function can be studied from these plots. It is found that there is a reasonably random distribution of points, indicating that there is a linear relationship between the number of copiers serviced and service time.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(copier, "of the Residuals of Predicting Service Time by Number of Copiers Serviced")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(copier) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9892079
## exp_val 0.9892079 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .10\) and \(n=45\), is \(.981\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
The time plot of the residuals is shown below.
df %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = resid)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Residuals",
title = "Residuals in Predictions", subtitle = "in time order")
There does not appear to be any correlation in error terms over time.
Answer:
breusch.pagan = function(df, alpha){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
df_new = data.frame(x = df$x, y = df$y, pred = pred, resid = resid, resid_sq = resid^2)
sse = sum(resid^2)
model_sq_error = lm.fit_manual(df_new$x, df_new$resid_sq)
pred_sq_error = model_sq_error[1] + model_sq_error[2]*df_new$x
ssr_ast = sum((pred_sq_error - mean(df_new$resid_sq))^2)
chi_sq_BP = round((ssr_ast/2) / ((sse/nrow(df))^2), 3)
if(chi_sq_BP < qchisq(1 - alpha, 1)){
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(copier, alpha = .05)
## [1] "At the alpha level of 0.05 the test statistic is 1.315 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Answer:
The residuals against \(X_2\) and \(X_3\) are shown below.
copier_fulldata = read.csv('CH03PR04.txt', sep = '', header = FALSE,
col.names = c('y', 'x1', 'x2', 'x3'),
colClasses = rep('numeric', 4))
copier_fulldata_resids = merge(prob_plot_table(copier), copier_fulldata)
plot_1 = copier_fulldata %>% ggplot(aes(x = x2, y = resid)) + geom_point() +
labs(x = "Mean Operational Age of Copiers. in months", y = "Residuals",
title = "Residuals against Mean Operational Age of Copiers")
plot_2 = copier_fulldata %>% ggplot(aes(x = x3, y = resid)) + geom_point() +
labs(x = "Years of Experience of the Service Person", y = "Residuals",
title = "Residuals against Years of Experience of the Service Person")
grid.arrange(plot_1, plot_2, nrow = 2)
Residuals increase as the mean operational age of copiers, in months, increases, indicating that the model does not predict well for older copiers. There does not seem to be any pattern in model residuals and years of experience of the service person making the call.
Refer to Airfreight breakage Problem 1.21.
Answer:
The dot plot for the number of transfers is shown below.
airfreight = read.csv('CH01PR21.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
airfreight %>%
ggplot(aes(x = x)) + geom_dotplot(fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Number of Transfers", y = '', title = "Dot Plot of the Number of Transfers")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of the number of transfers appears to be assymmetrical, where as the number of transfers increases, the number of such cases decreases.
Answer:
The time plot of the number of transfers is shown below.
airfreight %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = x)) +
geom_point() +
geom_line(color = "deepskyblue2") +
scale_x_continuous(breaks = 1:10) +
labs(x = "time", y = "Number of Transfers",
title = "Number of Transfers", subtitle = "in time order")
There does not appear to be any systematic evidence of any trend in the number of transfers over time.
Answer:
The stem-and-leaf plot of the residuals is shown below.
model = lm.fit_manual(airfreight$x, airfreight$y)
pred = model[1] + model[2]*airfreight$x
resid = airfreight$y - pred
df = data.frame(x = airfreight$x, y = airfreight$y, pred, resid)
stem(df$resid)
##
## The decimal point is at the |
##
## -2 | 2
## -1 | 222
## -0 | 2
## 0 | 888
## 1 | 88
There are two high peaks in the residuals.
Answer:
The plots of the residuals \(e_i\) against \(X_i\) is shown below.
df %>% ggplot(aes(x = x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "number of transfers", y = "residual",
title = "Number of Transfers vs Residual")
As the number of transfers increases, the residuals become smaller and smaller. This is a sign of nonconstant variance of the error terms.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(df, "of the Residuals of Predicting Breakage using Number of Transfers")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(airfreight) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9913394
## exp_val 0.9913394 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .01\) and \(n=10\), is \(.879\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
The time plot of the residuals is shown below.
df %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = resid)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Residuals",
title = "Residuals in Predictions", subtitle = "in time order")
There is no pattern in the residuals as time passes, meaning there is no correlation in the error terms over time.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(airfreight, alpha = .1)
## [1] "At the alpha level of 0.1 the test statistic is 1.033 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
This conclusion is supported by the preliminary findings in part (d).
Refer to Plastic hardness Problem 1.22.
Answer:
The boxplot of the residuals is shown below.
plastic = read.csv('CH01PR22.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
plastic_model = prob_plot_table(plastic)
plastic_model %>%
ggplot(aes(x = '', y = resid)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'Residual',
title = "Box Plot of the Residuals of Predicting Hardness of Plastic using Time Elapsed") +
coord_flip()
The residuals appear to be symmetrically distributed, with a median slightly above 0
Answer:
The residuals \(e_i\) are plotted against the fitted values \(\hat{Y}_i\) below.
plastic_model %>%
ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Time Elapsed and Hardness of Plastic")
The residuals appear to be random yet have a downward trending slope. This could mean the variance of the error term is nonconstant.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(plastic, "of the Residuals between Time Elapsed and Hardness of Plastic")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(plastic) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9916733
## exp_val 0.9916733 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=16\), is \(.941\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
expected_freq = function(df){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
n = nrow(df)
resids_df = data.frame(x = df$x, y = df$y, pred, resid)
mse_sq_root = sqrt(sum((resids_df$y - resids_df$pred)^2) / (n - 2))
resids_df = resids_df %>% mutate("Run" = 1:nrow(resids_df),
"k" = rank(resid)) %>%
mutate("exp_val_25" = mse_sq_root * qt((k - .375)/(n + .25), .25),
"exp_val_50" = mse_sq_root * qt((k - .375)/(n + .25), .5),
"exp_val_75" = mse_sq_root * qt((k - .375)/(n + .25), .75))
return(resids_df)
}
expected_freq_table = expected_freq(plastic)
plot_1 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_25)) +
geom_point() +
geom_text(aes(x = 0, y = -20000, label = "25%")) +
labs(x = '',
y = "Expected Value of Residuals Under Normality",
title = "Normal Probablity Plot using Various Expected Frequencies under Normality")
plot_2 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_50)) +
geom_point() +
geom_text(aes(x = 0, y = -200, label = "50%")) +
labs(x = "Residuals", y = '', title = '')
plot_3 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_75)) +
geom_point() +
geom_text(aes(x = 0, y = -45, label = "75%")) +
labs(x = '', y = '', title = '')
grid.arrange(plot_1, plot_2, plot_3, nrow = 1)
The information provided by these comparisons is consistent with the findings from the normal probability plot in part (c). As the percentile of the releveant \(t\) distribution increases, the points start to align themselves on a \(y=x\) line and the scale of the expected value of the residual becomes smaller.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(plastic, 25, 0.05)
## [1] "At the alpha level of 0.05 the test statistic is 0.856 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Refer to Muscle mass Problem 1.27.
Answer:
A stem-and-leaf plot of the ages is shown below.
muscle = read.csv('CH01PR27.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
stem(muscle$x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 11122334
## 4 | 5677788
## 5 | 123344
## 5 | 56777999
## 6 | 00013334
## 6 | 5556889
## 7 | 001223
## 7 | 5666788888
It appears to be that there is randomness in the ages of each age group of women. The distribution appears to be uniform with no skewness in either ends, nor is there any sense of normality.
Answer:
The dot plot of the residuals is shown below.
muscle_model = prob_plot_table(muscle)
muscle_model %>%
ggplot(aes(x = resid)) +
geom_dotplot(binwidth = 2, fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Residuals",
title = "Dot Plot of the Residuals", subtitle = "of Regressing Muscle Mass on Age")
The residuals are skewed to the right due to the presence of an outlier.
Answer:
The plot of the residual \(e_i\) against \(\hat{Y}_i\) and also against \(X_i\) is shown below.
plot_1 = muscle_model %>% ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Age on Muscle Mass ")
plot_2 = muscle_model %>% ggplot(aes(x = x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "age", y = "residual",
title = "Age vs Residual")
grid.arrange(plot_1, plot_2, nrow = 2)
These two plots are showing the same information yet reversed. This is because of the negative trend between muscle mass and age. Nonetheless, since the residuals appear to be scattered and not follow a particular pattern, the error terms should have a constant variance. Residuals are slightly larger at the end of 1 side than the other.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(muscle, "of the Residuals between Muscle Mass and Age")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(muscle) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9897499
## exp_val 0.9897499 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .10\) and \(n=60\), is \(.984\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(muscle, alpha = .01)
## [1] "At the alpha level of 0.01 the test statistic is 3.817 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Refer to Crime rate Problem 1.28.
Answer:
The stem-and-leaf plot of the percentage of individuals in the county having at least a high school diploma is shown below.
crime = read.csv('CH01PR28.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
stem(crime$x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 6 | 1444
## 6 | 5678
## 7 | 00334444
## 7 | 5555666677777778888888999999
## 8 | 000011111112222222233333344444
## 8 | 55578889
## 9 | 11
There is much higher number of percentages in the 70s and 80s range than either ends. The percentages are distributed normally.
Answer:
The box plot of the residuals is shown below.
crime_model = prob_plot_table(crime)
crime_model %>%
ggplot(aes(x = '', y = resid)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'Residual',
title = "Box Plot of the Residuals of Predicting Crime Rate",
subtitle = "using % of People with High School Diplomas") +
coord_flip()
The distribution of the residuals does not appear to be symmetrical.
Answer:
The residual plot of \(e_i\) against \(\hat{Y}_i\) is shown below.
crime_model %>%
ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Crime Rate")
The residuals appear to be more varied in the middle range of predictions whereas at either ends it congregates towards \(0\).
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(crime, "of the Residuals in Predicting Crime Rate using % of HSD")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(crime) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9887589
## exp_val 0.9887589 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=84\), is \(.985\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(crime, 69, 0.05)
## [1] "At the alpha level of 0.05 the test statistic is 0.355 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Electricity consumption. An economist studying the relation between household electricity (\(Y\)) and number of rooms in the home (\(X\)) employed linear regression model (2.1) and obtained the residuals. Plot the residuals against \(X_i\). What problem appears to be present here? Might a transformation alleviate the problem?
Answer:
A plot of the residuals against \(X_i\) is shown below.
electricity = read.csv('CH03PR09.txt', sep = '', header = FALSE,
col.names = c('x', 'e'),
colClasses = c('numeric', 'numeric'))
electricity %>%
ggplot(aes(x = x, y = e)) +
geom_point(color = "cadetblue") +
labs(x = "number of rooms", y = "residual",
title = "Number of Rooms vs Residual")
Residuals appear to become large and small in no detectable pattern. A transformation might alleviate this problem and find a viable way to correct the nonlinearity of the regression function.
Per capita earnings. A sociologist employed linear regression model (2.1) to relate per capita earnings (\(Y\)) to average number of years of schooling (\(X\)) for \(12\) cities. The fitted values \(\hat{Y}_i\) and the semistudentized residuals \(e_i^*\) are given.
Answer:
A plot of the semistudentized residuals against the fitted values is shown below.
earnings = read.csv('CH03PR10.txt', sep = '', header = FALSE,
col.names = c('y', 'e'),
colClasses = c('numeric', 'numeric'))
earnings %>%
ggplot(aes(x = y, y = e)) +
geom_point() +
labs(x = "Predicted Per Capita Earning", y = "Semistudentized Residual",
title = "Predicted Per Capita Earning vs Semistudentized Residual")
There is no pattern between the predicted per capita earning and the semistudentized residuals. However there appears to be one outlier that has a residual much lower than the other predicted per capita earnings.
Answer:
The above plot is redrawn with two lines added indicating the area above and below \(1\) standard deviation.
earnings %>%
ggplot(aes(x = y, y = e)) +
geom_point() +
geom_hline(yintercept = -1,
linetype = "dashed", color = "magenta4") +
geom_hline(yintercept = 1,
linetype = "dashed", color = "magenta4") +
labs(x = "Predicted Per Capita Earning", y = "Semistudentized Residual",
title = "Predicted Per Capita Earning vs Semistudentized Residual")
There appears to be \(4\) semistudentized residuals outside \(\pm 1\) standard deviation. If the normal error model was appropriate, then there should be \(4\) semistudented residuals outside \(\pm 1\) standard deviation, since \(68\%\) of the values shall fall in within \(1\) standard deviation of the mean. In this case, since the size of the data is \(12\), \(68\%\) of it is \(8.16\), which is within \(1\) standard deviation, and the remaining \(3.84\), or \(4\) falls outside.
Drug concentration. A pharmacologist employed linear regression model (2.1) to study the relation between the concentration of a drug in plasma (\(Y\)) and the log-dose of the drug (\(X\)). The residuals and log-dose levels are given.
Answer:
A plot of the residuals against the \(X_i\) values is shown below.
drug = read.csv('CH03PR11.txt', sep = '', header = FALSE,
col.names = c('x', 'e'),
colClasses = c('numeric', 'numeric'))
drug %>%
ggplot(aes(x = x, y = e)) +
geom_point() +
labs(x = "Log Dose of the Drug", y = "Residual",
title = "Log Dose of the Drug vs Residual")
As the log dose of the drug increases, the residual grows in magnitude. This shows that the error variance is not constant.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan.error = function(df, alpha){
df_new = data.frame(x = df$x, resid = df$e, resid_sq = (df$e)^2)
sse = sum(df$e^2)
model_sq_error = lm.fit_manual(df_new$x, df_new$resid_sq)
pred_sq_error = model_sq_error[1] + model_sq_error[2]*df_new$x
ssr_ast = sum((pred_sq_error - mean(df_new$resid_sq))^2)
chi_sq_BP = round((ssr_ast/2) / ((sse/nrow(df))^2), 3)
if(chi_sq_BP < qchisq(1 - alpha, 1)){
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
breusch.pagan.error(drug, .05)
## [1] "At the alpha level of 0.05 the test statistic is 3.718 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
A student does not understand why the sum of squares defined in (3.16) is called a pure error sum of squares “since the formula looks like one for an ordinary sum of squares” Explain.
Answer:
The equation given in (3.16) is \[ \text{SSE}(F) = \sum_j \sum_i (Y_{ij} - \overline{Y}_{j})^2 = \text{SSPE} \] This is called the pure error sum of squares because it quantifies how much of a lack of fit in \(Y_{ij}\) there is from the estimated value for \(Y_{ij}\). It is different from the formula for the ordinary sum of squares because it is derived from the notion that the full model for a simple linear regression model is \[ Y_{ij} = \mu_j + \varepsilon_{ij} \] where here there is no restrictions on the means \(\mu_j\) (\(E[Y_{ij}] = \mu_j\)). In the regression model, the mean responses are linearly related to \(X\), or \(E[Y] = \beta_0 + \beta_1X\).
Refer to Copier maintenance Problem 1.20.
Answer: When testing for lack of fit of a linear regression function, the alternative conclusions are such that: the expected value of \(Y\) is \(\beta_0 + \beta_1X\) (or \(H_0: E[Y] = \beta_0 + \beta_1X\)) and the expected value of \(Y\) is not \(\beta_0 + \beta_1X\) (or \(H_A: E[Y] \neq \beta_0 + \beta_1X\)).
Answer:
lack.of.fit.test = function(data, alpha = 0.05){
X = data$x
Y = data$y
Y_bars = data.frame(x = sort(unique(X)))
mean_vec = c()
for(i in Y_bars$x){
temp_mean = mean(data[data$x == i, "y"])
mean_vec = c(mean_vec, temp_mean)
}
Y_bars$ybars = mean_vec
SSPE= 0
for(i in Y_bars$x){
temp_Y = data[data$x == i, "y"]
SSPE = SSPE + sum((temp_Y - Y_bars[Y_bars$x == i, "ybars"])^2)
}
df_F = nrow(data) - length(unique(X))
model = lm.fit_manual(X, Y)
Ypreds = model[1] + model[2]*X
SSE_R = sum((Ypreds - Y)^2)
df_R = nrow(data) - 2
F_ast = round(((SSE_R - SSPE) / (df_R - df_F))/(SSPE / df_F), 3)
if(F_ast < qf(1 - alpha, length(unique(X)) - 2, nrow(data) - length(unique(X)))){
paste("At the alpha level of", alpha, "the test statistic is", F_ast, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the regression function is linear.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", F_ast, "and the null hypothesis is rejected. There is evidence that the regression function is not linear.", sep = ' ')
}
}
lack.of.fit.test(copier, 0.05)
## [1] "At the alpha level of 0.05 the test statistic is 0.968 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the regression function is linear."
Answer: The lack of fit test decomposes the error sum of squares SSE as follows \[ SSE = SSPE + SSLF \] where SSE is the error deviation, SSPE is the pure error deviation and SSLF is the lack of fit deviation. This reveals that the error deviations in SSE depend on the lack of fit deviation. Hence if there is non-constant variance or lack of normality in the error terms, it will be shown in the lack of fit component of the SSE.
Refer to Plastic hardness Problem 1.22.
Answer: Let the null hypothesis be that there exists a linear regression function associating hardness of plastic and the time elapsed. Let the alternative hypothesis be that there does not exist such a linear regression function. Then
lack.of.fit.test(plastic, 0.01)
## [1] "At the alpha level of 0.01 the test statistic is 0.824 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the regression function is linear."
Answer: There is no advantage of having an equal number of replications at each of the \(X\) levels.
Answer: When the lack of fit test concludes that the regression function is not linear, it does not indicate which family of regression function is the true representation. To do so, studying the residuals can help to identify an appropriate family of functions.
Solution concentration. A chemist studied the concentration of a solution (\(Y\)) over time (\(X\)). Fifteen identical solutions were prepared. The \(15\) solutions were randomly divided into five sets of three, and the five sets were measured, respectively, after \(1\), \(3\), \(5\), \(7\) and \(9\) hours.
Answer:
solutions = read.csv('CH03PR15.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
model = round(lm.fit_manual(solutions$x, solutions$y), 3)
paste("The coefficients of the linear regression function are b0 = ", model[1],
" and b1 = ", model[2], ".", sep = '')
## [1] "The coefficients of the linear regression function are b0 = 2.575 and b1 = -0.324."
Answer: The null hypothesis is that there exists a linear regression function relating the concentration of a solution to time. The alternative hypothesis is that there is no such linear function. Then
lack.of.fit.test(solutions, 0.025)
## [1] "At the alpha level of 0.025 the test statistic is 58.603 and the null hypothesis is rejected. There is evidence that the regression function is not linear."
Answer: The lack of best fit test does not indicate what regression function is appropriate when it leads to the conclusion that lack of fit of a linear regression function exists. It only tells that a linear relationship cannot be established.
Refer to Solution concentration Problem 3.15.
Answer:
plastic %>%
ggplot(aes(x,y)) +
geom_point() +
labs(x = "time", y = "concentration",
title = "Concentration of Solution at Various Times")
To achieve constant variance and linearity, a \(\log_{10}\) transformation on \(Y\) might be useful.
Answer:
box.cox = function(x, y, lambda){
sse = c()
for(i in lambda){
if(i != 0){
y_prime = y^i
}
else{
y_prime = log(y)
}
k2 = (prod(y))^(1/length(y))
k1 = 1/(i*k2^(i-1))
if(i != 0){
W = k1*(y_prime - 1)
}
else{
W = k2*y_prime
}
model = lm.fit_manual(x, W)
preds = model[1] + model[2]*x
sse_temp = sum((W - preds)^2)
sse = c(sse, sse_temp)
}
df = data.frame(lambda = lambda, sse = sse)
lambda_min_sse = df[df$sse == min(df$sse), "lambda"]
if(lambda_min_sse == 0){
print("Using the Box-Cox procedure and strandardization, the most appropriate power transformation on Y is a natural log transformation.")
}
else{
paste("Using the Box-Cox procedure and standardizarion, the most appropriate power transformation on Y is Y to the power of ", lambda_min_sse, '.', sep = '')
}
}
box.cox(solutions$x, solutions$y, seq(-.2, .2, length = 5))
## [1] "Using the Box-Cox procedure and strandardization, the most appropriate power transformation on Y is a natural log transformation."
Answer:
Y_prime = log10(solutions$y)
model = round(lm.fit_manual(solutions$x, Y_prime), 3)
paste("b0:", model[1], ", b1:", model[2], sep = ' ')
## [1] "b0: 0.655 , b1: -0.195"
Answer:
solutions_transformed = data.frame(x = solutions$x,
y = log10(solutions$y),
y_original = solutions$y,
y_pred = model[1] + model[2]*solutions$x,
resids = model[1] + (model[2]*solutions$x) - log10(solutions$y))
solutions_transformed %>%
ggplot(aes(x, y)) +
geom_point() +
geom_abline(intercept = model[1], slope = model[2]) +
labs(x = "time",
y = "log10(concentration)",
title = "Regressing on Log10(concentration)")
The regression line appears to be a good fit to the transformed data.
Answer:
plot_1 = solutions_transformed %>%
ggplot(aes(x = y, y = resids)) +
geom_point() +
labs(x = "fitted value",
y = "residual",
title = "Fitted Values vs Residual Plot")
plot_2 = prob_plot(solutions_transformed, "of Log10(concentration) vs Time")
grid.arrange(plot_1, plot_2, nrow = 1)
The transformation causes error terms to be distributed normally. The error terms have nonconstant variance.
Answer:
paste("Y = concentration = 10^(", model[1], " + ", model[2], "* x)", sep = '')
## [1] "Y = concentration = 10^(0.655 + -0.195* x)"
Sales growth. A marketing researcher studied annual sales of a product that had been introduced 10 years ago. The data is given as follows, where \(X\) is the year (coded) and \(Y\) is sales in thousands of units.
Answer:
sales = read.csv('CH03PR17.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
sales %>%
ggplot(aes(x,y)) +
geom_point() +
labs(x = "year",
y = "sales (in thousands)",
title = "Sales of Product vs Year")
There appears to be a linear relationship between the two variables.
Answer:
box.cox(sales$x, sales$y, seq(.3, .7, by = .1))
## [1] "Using the Box-Cox procedure and standardizarion, the most appropriate power transformation on Y is Y to the power of 0.5."
Answer:
Y_prime = sqrt(sales$y)
model = round(lm.fit_manual(sales$x, Y_prime), 3)
paste("b0:", model[1], ", b1:", model[2], sep = ' ')
## [1] "b0: 10.261 , b1: 1.076"
Answer:
sales_transformed = data.frame(x = sales$x,
y = sqrt(sales$y),
y_original = sales$y,
y_pred = model[1] + model[2]*sales$x,
resids = model[1] + (model[2]*sales$x) - sqrt(sales$y))
sales_transformed %>%
ggplot(aes(x, y)) +
geom_point() +
geom_abline(intercept = model[1], slope = model[2]) +
labs(x = "year",
y = "sqrt(sales)",
title = "Regressing on Sqrt(Sales)")
The regression line appears to be a good fit to the transformed data.
Answer:
plot_1 = sales_transformed %>%
ggplot(aes(x = y, y = resids)) +
geom_point() +
labs(x = "fitted value",
y = "residual",
title = "Fitted Values vs Residual Plot")
plot_2 = prob_plot(solutions_transformed, "of Sqrt(Sales) vs Year")
grid.arrange(plot_1, plot_2, nrow = 1)
The residuals are distributed normally and have a nonconstant variance.
Answer:
paste("Y = (", model[1], " + ", model[2], "* x)^2", sep = '')
## [1] "Y = (10.261 + 1.076* x)^2"
Production time. In a manufacturing study, the production times for \(111\) recent production runs were obtained. The table lists for each run the production time in hours (\(Y\)) and the production lot size (\(X\)).
Answer:
prod_time = read.csv('CH03PR18.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
prod_time %>%
ggplot(aes(x,y)) +
geom_point() +
labs(x = "lot size",
y = "production time (in hrs)",
title = "Production Time (in hrs) vs Lot Size")
There does not appear to exist a linear relation between the two variable. A transformation on the \(X\) would be appropriate here since there exists replicates of \(X\) in this data set.
Answer:
prod_time_transformed = data.frame(x = sqrt(prod_time$x), x_original = prod_time$x, y = prod_time$y)
prod_time_transformed = data.frame(x_original = prod_time$x,
x = sqrt(prod_time$x),
y = prod_time$y,
y_pred = lm.fit_manual(sqrt(prod_time$x), prod_time$y)[1] + lm.fit_manual(sqrt(prod_time$x), prod_time$y)[2]*sqrt(prod_time$x),
resids = lm.fit_manual(sqrt(prod_time$x), prod_time$y)[1] + (lm.fit_manual(sqrt(prod_time$x), prod_time$y)[2]*sqrt(prod_time$x)) - prod_time$y)
model = round(lm.fit_manual(prod_time_transformed$x, prod_time_transformed$y), 3)
paste("b0:", model[1], ", b1:", model[2], sep = ' ')
## [1] "b0: 1.255 , b1: 3.624"
Answer:
prod_time_transformed %>%
ggplot(aes(x, y)) +
geom_point() +
geom_abline(intercept = model[1], slope = model[2]) +
labs(x = "sqrt(lot size)",
y = "production time (in hrs)",
title = "Regressing using Sqrt(Lot Size)")
The regression line seems to be an okay fit for the transformed data.
Answer:
plot_1 = sales_transformed %>%
ggplot(aes(x = y, y = resids)) +
geom_point() +
labs(x = "fitted value",
y = "residual",
title = "Fitted Values vs Residual Plot")
plot_2 = prob_plot(solutions_transformed, "of Sqrt(Sales) vs Year")
grid.arrange(plot_1, plot_2, nrow = 1)
The residuals have nonconstant variance and distributed normally.
Answer:
paste("Y = ", model[1], " + ", model[2], "sqrt(X)", sep = '')
## [1] "Y = 1.255 + 3.624sqrt(X)"
A student fitted a linear regression function for a class assignment. The student plotted the residuals \(e_i\) against \(Y_i\) and found a positive relation. When the residuals were plotted against the fitted values \(\hat{Y}_i\), the student found no relation. How could this difference arise? Which is the more meaningful plot?
Answer: Plotting residuals against the original \(Y\) values would indicate nothing valuable about the created model. There is always a positive relationship between \(Y_i\) and \(e_i\) since \(e_i = Y_i - \hat{Y}_i\) and because \(e_i\) and \(\hat{Y}_i\) are independent, their covariance is always positive. The more meaningful plot is \(e_i\) against \(\hat{Y}_i\).
If the error terms in a regression model are independent \(N(0, \sigma^2)\), what can be said about the error terms after transformation \(X' = \frac{1}{X}\) is used? Is the situation the same after transformation \(Y' = \frac{1}{Y}\) is used?
Answer: If the error terms in a regression model are independent \(N(0, \sigma^2)\), then the transformation \(X' = \frac{1}{X}\) will not change the shape of the normal distribution of the error terms, except with a shift in where the mean lies. However if the transformation \(Y' = \frac{1}{Y}\) is used, then that can change the distribution of the error terms, since the inverse of a normally distributed random variable is not normally distributed.
Derive the result in (3.29).
Answer: \[ \underbrace{\sum \sum (Y_{ij} - \hat{Y}_{ij})^2}_{\text{SSE}} = \underbrace{\sum \sum (Y_{ij} - \overline{Y}_j)^2}_{\text{SSPE}} + \underbrace{\sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2}_{\text{SSLF}} \]
The sum of squares can be decomposed into two sums: the full model sum of squares (or pure error sum of squares), and the lack of fit sum of squares. Start from the left and note that \(\hat{Y}_{ij} = b_0 + b1X_j\) is independent of \(j\).
\[ \begin{aligned} \sum \sum (Y_{ij} - \hat{Y}_{ij})^2 &= \sum \sum [(Y_{ij} - \overline{Y}_j) + (\overline{Y}_j - \hat{Y}_{ij})]^2 \\ &= \sum \sum (Y_{ij} - \overline{Y})^2 + \sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2 + 2\left(\sum \sum (Y_{ij} - \overline{Y}_j)(\overline{Y}_j - \hat{Y}_{ij})\right) \\ &= \sum \sum (Y_{ij} - \overline{Y})^2 + \sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2 + 2\left[\sum \sum Y_{ij} \overline{Y}_j - \sum \sum \overline{Y}_j^2 - \sum \sum Y_{ij}\overline{Y}_{ij} + \sum \sum \overline{Y}_j\hat{Y}_{ij}\right] \\ &= \sum \sum (Y_{ij} - \overline{Y})^2 + \sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2 + 2\left[\sum_j n_j \overline{Y}_j^2 - \sum_j n_j\overline{Y}_j^2 - \sum_j \hat{Y}_{ij}n_j\overline{Y}_j + \sum_j n_j\overline{Y}_j\hat{Y}_{ij}\right] \\ &= \sum \sum (Y_{ij} - \overline{Y})^2 + \sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2 + 0 \\ &= \sum \sum (Y_{ij} - \overline{Y})^2 + \sum \sum (\overline{Y}_j - \hat{Y}_{ij})^2 \end{aligned} \]
Using (A.70), (A.41) and (A.42), show that \(\text{E[MSPE]} = \sigma^2\) for normal error regression model (2.1).
Answer: Note that, from (A.70), for the sample variance \(s^2\), \(\frac{(n-1)s^2}{\sigma^2}\) is distributed as \(\chi^2\) with \(n-1\) degrees of freedom when the random sample is from a normal population.
MSPE, or pure error mean square, is defined as \(\frac{\text{SSPE}}{n-c}\), or pure error sum of squares divided by the sum of the component degrees of freedom. Now, SSPE is made up of the sums od squared deviations at each \(X\) level, or \(\sum_i (Y_{ij} - \overline{Y}_j)^2\). Therefore, using the above property, \[ \begin{aligned} \text{E[MSPE]} &= \text{E}\left[\frac{\sum_j \sum_i (Y_{ij} - \overline{Y}_j)^2}{n-c}\right] \\ &= \frac{1}{n-c} \sum_j \text{E}\left[(n_j - 1)s_j^2\right] \\ &= \frac{1}{n-c} \sum_j \text{E}[\sigma^2 \chi^2 (n_j - 1)] \\ &= \frac{\sigma^2}{n-c} \sum_j (n_j - 1) \\ &= \sigma^2 \end{aligned} \]
A linear regression model with intercept \(\beta_0=0\) is under consideration. Data have been obtained that contain replications. State the full and reduced models for testing the appropriateness of the regression function under consideration. What are the degrees of freedom associated with the full and reduced models if \(n=20\) and \(c=10\)?
Answer: The full model is \[Y_{ij} = \mu_j + \varepsilon_{ij}\] and the reduced model is \[ Y_{ij} = \beta_1X_j + \varepsilon_{ij} \] When \(n=20\) and \(c=10\), the degrees of freedom are as follows: \[ \begin{aligned} df_F &= n - c = 20 - 10 = 10 \\ df_R &= n - 1 = 20 - 1 = 19 \end{aligned} \]
Blood pressure. Data was obtained in a study of the relation between diastolic blood pressure (\(Y\)) and age (\(X\)) for boys 5 to 13 years old.
Answer:
blood_pressure = read.csv('CH03PR24.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
model = round(lm.fit_manual(blood_pressure$x, blood_pressure$y), 3)
paste("b0:", model[1], "b1:", model[2], sep = ' ')
## [1] "b0: 48.667 b1: 2.333"
Y_pred = model[1] + model[2]*blood_pressure$x
resid = Y_pred - blood_pressure$y
data.frame(resid, x = blood_pressure$x) %>%
ggplot(aes(x, resid)) +
geom_point() +
labs(x = "age",
y = "residual",
title = "Residuals of Predicting Diastolic Blood Pressure using Age")
The residual plot shows that there is an outlier at age 12, which makes the distribution of error terms to have nonconstant variance.
Answer:
blood_pressure_new = blood_pressure[c(1:6, 8:nrow(blood_pressure)),]
model = round(lm.fit_manual(blood_pressure_new$x, blood_pressure_new$y), 3)
paste("b0:", model[1], "b1:", model[2], sep = ' ')
## [1] "b0: 53.068 b1: 1.621"
By omitting case 7, the y intercept of the regression function is higher and the slope is less steep. The effect of case 7 was bringing the regression function down.
Answer:
Yhat_pred_interval = function(data, Xhat, alpha = 0.01){
X = data[['x']]
Y = data[['y']]
df = nrow(data) - 2
b1 = sum((X - mean(X))*(Y - mean(Y))) / (sum((X - mean(X))^2))
b0 = mean(Y) - b1*mean(X)
Ypred = b0 + b1*X
Yhat = b0 + b1*Xhat
mse = sum((Y - Ypred)^2) / df
var_adjust = (1/nrow(data)) + ((Xhat - mean(X))^2)/sum((X - mean(X))^2) + 1
t = qt(1 - (alpha/2), df = df)
se = sqrt(mse * var_adjust)
error = t*se
lower = round(Yhat - error, 5)
upper = round(Yhat + error, 5)
return(c(lower, upper))
}
Yhat_pred_interval(blood_pressure_new, Xhat = 12, alpha = 0.01)
## [1] 60.31266 84.73588
blood_pressure[7,]
## y x
## 7 90 12
Observation \(Y_7\) falls outside this prediction interval. This signifies that the observation is an outlier from the collection of the data.
Refer to the CDI data set in Appendix C.2 and Project 1.43. For each of the three fitted regression models, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Summarize your conclusions. Is linear regression model (2.1) more appropriate in one case than in the others?
Answer:
cdi_cols = c('ID', 'county', 'state', 'area', 'total_pop', 'perc_pop_18to34',
'perc_pop_65plus', 'num_physicians', 'num_hosp_beds', 'total_crimes',
'perc_hs_grads', 'perc_bach', 'perc_below_poverty', 'perc_unemploy',
'per_capita_income', 'total_personal_income', 'geographic_region')
cdi_colclasses = c('integer', 'character', 'character', rep('numeric', 13), 'factor')
cdi = read.csv('APPENC02.txt', sep = '', header = FALSE,
col.names = cdi_cols,
colClasses = cdi_colclasses)
cdi_43 = cdi %>%
select(num_physicians, total_pop,
num_hosp_beds, total_personal_income)
plot_1 = cdi_43 %>%
gather(type, x, total_pop, num_hosp_beds, total_personal_income) %>%
group_by(type) %>%
do(., augment(lm(num_physicians ~ x, data = .))) %>%
ggplot(aes(x = x, y = .resid, color = type)) +
geom_point() +
facet_wrap(type ~ ., ncol = 2, scales = "free") +
theme(legend.position = '') +
labs(x = '',
y = "residual",
title = "Residuals vs X")
plot_2 = prob_plot(select(cdi_43, x = num_hosp_beds, y = num_physicians), "Number of Hospital Beds")
plot_3 = prob_plot(select(cdi_43, x = total_personal_income, y = num_physicians), "Total Personal Income")
plot_4 = prob_plot(select(cdi_43, x = total_pop, y = num_physicians), "Total Population")
plot_1
grid.arrange(plot_2, plot_3, plot_4, ncol = 3)
Error terms from all 3 regression models have a constant variance and are distributed normally, with heavy tails at the higher end. The linear regression model is appropriate when using number of hospital beds, as the model diagnostics are most normal in use.
Refer to the CDI data set in Appendix C.2 and Project 1.44. For each geographic region, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Do the four regions appear to have similar error variances? What other conclusions do you draw from your plots?
Answer:
cdi %>% select(per_capita_income, perc_bach, geographic_region) %>%
group_by(geographic_region) %>%
do(., augment(lm(per_capita_income ~ perc_bach, data = .))) %>%
ggplot(aes(x = perc_bach, y = .resid, color = factor(geographic_region))) +
geom_point() +
facet_wrap(geographic_region ~ ., ncol = 2, scales = "free") +
theme(legend.position = '') +
labs(x = '',
y = "residual",
title = "Residuals vs X")
plot_1 = prob_plot(select(filter(cdi, geographic_region == 1), x = perc_bach, y = per_capita_income), "Geo Loc 1")
plot_2 = prob_plot(select(filter(cdi, geographic_region == 2), x = perc_bach, y = per_capita_income), "Geo Loc 2")
plot_3 = prob_plot(select(filter(cdi, geographic_region == 3), x = perc_bach, y = per_capita_income), "Geo Loc 3")
plot_4 = prob_plot(select(filter(cdi, geographic_region == 4), x = perc_bach, y = per_capita_income), "Geo Loc 4")
grid.arrange(plot_1, plot_2, plot_3, plot_4, ncol = 2)
There is some occurrence of higher variances for high values of \(X\) in most of the regression plots. Thus the four regions do not appear to have similar error variances. However residuals are distributed normally in all 4 regression models.
Refer to the SENIC data set in Appendix C.1 and Project 1.45.
Answer:
senic_cols = c('ID', 'length_stay', 'age', 'infection_risk', 'culturing_ratio',
'chest_xray_ratio', 'num_beds', 'med_school_aff', 'region',
'avg_daily_census', 'num_nurses', 'available_facilities')
senic_colclasses = c(rep('numeric', 6), rep('factor', 2), rep('numeric', 3))
senic = read.csv('APPENC01.txt', sep = '', header = FALSE,
col.names = senic_cols,
colClasses = senic_colclasses)
senic_model = senic %>%
select(length_stay, infection_risk, available_facilities, chest_xray_ratio)
plot_1 = senic_model %>%
gather(type, x, infection_risk, available_facilities, chest_xray_ratio) %>%
group_by(type) %>%
do(., augment(lm(length_stay ~ x, data = .))) %>%
ggplot(aes(x = x, y = .resid, color = type)) +
geom_point() +
facet_wrap(type ~ ., ncol = 2, scales = "free") +
theme(legend.position = '') +
labs(x = '',
y = "residual",
title = "Residuals vs X")
plot_2 = prob_plot(select(senic_model, x = chest_xray_ratio, y = length_stay), "Chest Xray Ratio")
plot_3 = prob_plot(select(senic_model, x = available_facilities, y = length_stay), "Available Facilities")
plot_4 = prob_plot(select(senic_model, x = infection_risk, y = length_stay), "Infection Risk")
plot_1
grid.arrange(plot_2, plot_3, plot_4, ncol = 3)
In all 3 regression models, error terms show a constant variance and are distributed normally, with slightly large tails at the high end. The linear regression model is apt in all 3 cases.
Answer:
senic_model_2 = senic %>%
filter(!ID %in% c(47, 112)) %>%
select(y = length_stay, x = infection_risk)
Yhat_pred_interval(senic_model_2, 6.5, 0.05)
## [1] 8.31863 13.30654
senic[47, c("ID", "length_stay", "infection_risk")]
## ID length_stay infection_risk
## 47 47 19.56 6.5
Yhat_pred_interval(senic_model_2, 5.8, 0.05)
## [1] 7.90776 12.86377
senic[112, c("ID", "length_stay", "infection_risk")]
## ID length_stay infection_risk
## 112 112 17.94 5.9
Observations \(Y_{47}\) and \(Y_{112}\) both fall outside these prediction intervals, indicating that they are outliers in the data collection.
Refer to the SENIC data set in Appendix C.1 and Project 1.46. For each geographic region, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Do the four regions appear to have similar error variance? What other conclusions do you draw from your plots?
Answer:
senic %>%
select(length_stay, infection_risk, region) %>%
group_by(region) %>%
do(., augment(lm(length_stay ~ infection_risk, data = .))) %>%
ggplot(aes(x = infection_risk, y = .resid, color = factor(region))) +
geom_point() +
facet_wrap(region ~ ., ncol = 2) +
labs(x = "Infection Risk",
y = "Residual",
title = "Residuals vs Infection Risk",
subtitle = "based on Region") +
theme(legend.position = '')
plot_1 = prob_plot(select(filter(senic, region == 1), x = infection_risk, y = length_stay), "Loc 1")
plot_2 = prob_plot(select(filter(senic, region == 2), x = infection_risk, y = length_stay), "Loc 2")
plot_3 = prob_plot(select(filter(senic, region == 3), x = infection_risk, y = length_stay), "Loc 3")
plot_4 = prob_plot(select(filter(senic, region == 4), x = infection_risk, y = length_stay), "Loc 4")
grid.arrange(plot_1, plot_2, plot_3, plot_4, ncol = 2)
The error variances appear to be constant and similar for regression models for regions 2, 3 and 4. Error variances may be higher for region 1. Error terms are distributed normally for region 2, 3 and 4. Region 1 error terms are distributed with high tails at both ends.
Refer to Copier maintenance Problem 1.20.
Answer:
plot_1 = copier %>%
mutate(band = case_when(
x <= 2.5 ~ 1,
x > 2.5 & x <= 4.5 ~ 2,
x > 4.5 & x <= 6.5 ~ 3,
x > 6.5 & x <= 8.5 ~ 4,
x > 8.5 ~ 5)) %>%
group_by(band) %>%
summarize(x_med = median(x, na.rm = TRUE),
y_med = median(y, na.rm = TRUE)) %>%
ggplot(aes(x_med, y_med)) +
geom_point(shape=23, fill="blue", size=4) +
geom_line(color = "darkgreen") +
geom_point(data = copier, aes(x,y)) +
labs(x = "number of copiers serviced",
y = "total number of minutes",
title = "Regressing Total Number of Minutes on Number of Copiers Serviced",
subtitle = "using Regresion Bands of Medians",
caption = "Note: The blue diamond represents medians of each band.")
## `summarise()` ungrouping output (override with `.groups` argument)
plot_1
The smooth band suggests that the regression relation is not linear. This is because the medians in each of the bands do not appear to match up on a straight line.
Answer:
conf_band_reg = function(data, alpha = 0.05){
model = lm.fit_manual(data$x, data$y)
Y_hat = model[1] + model[2]*data$x
mse = sum((Y_hat - data$y)^2)/(nrow(data)-2)
X_bar = mean(data$x)
X_iX_bar_diff_sum_sq = sum((data$x - X_bar)^2)
X_h = sort(unique(data$x))
Y_hat_sd = c()
for(i in X_h){
temp_sd = sqrt(mse * ((1/nrow(data)) + ((i - X_bar)^2/X_iX_bar_diff_sum_sq)))
Y_hat_sd = c(Y_hat_sd, temp_sd)
}
W_squared = 2*qf(1 - alpha, 2, nrow(data)-2)
W = sqrt(W_squared)
conf_band_df = data.frame(x = X_h,
y = model[1] + model[2]*X_h,
W = rep(W, length(X_h)),
s_Yhat = Y_hat_sd)
conf_band_df$lower_bound = conf_band_df$y - (conf_band_df$W*conf_band_df$s_Yhat)
conf_band_df$upper_bound = conf_band_df$y + (conf_band_df$W*conf_band_df$s_Yhat)
return(conf_band_df)
}
plot_1 +
geom_path(data = conf_band_reg(copier, alpha = 0.1),
aes(x = x, y = lower_bound), color = "firebrick") +
geom_path(data = conf_band_reg(copier, alpha = 0.1),
aes(x = x, y = upper_bound), color = "firebrick") +
labs(caption = "Note: The red lines indicates the upper and lower bounds of the regression line")
The band smooth does not fall entirely inside the confidence band. Hence the linear regression function may not be the best fit for this data.
Answer:
X_c_list = c()
pred_list = c()
for(i in seq(0.5, 6.5, by = 1)){
filtered_data = filter(copier, x >= i & x <= i + 3)
model = lm.fit_manual(filtered_data$x, filtered_data$y)
X_c = (i + i + 3)/2
pred = model[1] + model[2]*X_c
X_c_list = c(X_c_list, X_c)
pred_list = c(pred_list, pred)
}
data.frame(X_c = X_c_list, pred = pred_list) %>%
ggplot(aes(X_c, pred)) +
geom_point(color = "firebrick") +
geom_line(color = "firebrick") +
geom_point(data = copier, aes(x,y)) +
labs(x = "number of copiers serviced",
y = "total number of minutes",
title = "Regressing Total Bumber of Minutes on Number of Copiers Serviced",
subtitle = "using Lowess Smooth Method with Neighborhoods of Width 3")
The simplified lowess smooth does not fit the data fully from the minimum and maximum value of \(X\). It only covers values that are centers of discrete neighborhoods.
Refer to Sales growth Problem 3.17.
Answer:
sales %>%
mutate(band = case_when(
x > -.5 & x <= 1.5 ~ 1,
x > 1.5 & x <= 3.5 ~ 2,
x > 3.5 & x <= 5.5 ~ 3,
x > 5.5 & x <= 7.5 ~ 4,
x > 7.5 & x <= 9.5 ~ 5)) %>%
group_by(band) %>%
summarize(x_med = median(x), y_med = median(y)) %>%
ggplot(aes(x_med, y_med)) +
geom_point(color = "firebrick", size = 3) +
geom_path() +
geom_point(data = sales, aes(x,y)) +
labs(x = "year", y = "sales (in thousands)",
title = "Regressing on Sales using Year",
subtitle = "using Regression Bands of Medians")
## `summarise()` ungrouping output (override with `.groups` argument)
The band smooth suggests that the regression relation is not linear. The medians do not appear to line up on a straight line.
Answer:
X_c_list = c()
pred_list = c()
for(i in seq(-0.5, 5.5, by = 1)){
filtered_data = filter(sales, x >= i & x <= i + 3)
model = lm.fit_manual(filtered_data$x, filtered_data$y)
X_c = (i + i + 3)/2
pred = model[1] + model[2]*X_c
X_c_list = c(X_c_list, X_c)
pred_list = c(pred_list, pred)
}
plot_1 = data.frame(X_c = X_c_list, pred = pred_list) %>%
ggplot(aes(X_c, pred)) +
geom_point(color = "firebrick") +
geom_line(color = "firebrick") +
geom_point(data = sales, aes(x,y)) +
labs(x = "year",
y = "sales (in thousands)",
title = "Regressing Sales on Year",
subtitle = "using Lowess Smooth Method with Neighborhoods of Width 3")
Answer:
plot_1 +
geom_path(data = conf_band_reg(sales, alpha = 0.05),
aes(x = x, y = lower_bound), color = "blue") +
geom_path(data = conf_band_reg(sales, alpha = 0.05),
aes(x = x, y = upper_bound), color = "blue") +
labs(caption = "Note: The blue lines indicates the upper and lower bounds of the regression line")
The simplified lowess smooth appears to fall entirely within the confidence band for the regression line. This suggests that the linear regression function is appropriate.
Refer to the Real estate sales data set in Appendix C.7. Obtain a random sample of \(200\) cases from the \(522\) cases in this data set. Using the random sample, build a regression model to predict sales price (\(Y\)) as a function of finished square feet (\(X\)). The analysis should include an assessment of the degree to which the key regression assumptions are satisfied. If the regression assumptions are not met, include and justify appropriate remedial measures. Use the final model to predict sales price for two houses that are about to come on the market: the first has \(X=1100\) finished square feet and the second has \(X=4900\) finished square feet. Assess the strengths and weaknesses of the final model.
Answer:
real_estate_cols = c('ID', 'sales_price', 'sq_ft', 'num_bed',
'num_bath', 'air_cond', 'garage_size', 'pool', 'year_built',
'quality', 'style', 'lot_size', 'adj_to_highway')
real_estate_colclasses = c(rep('numeric', 5), 'factor',
'numeric', 'factor', 'numeric',
rep('factor', 2), 'numeric', 'factor')
real_estate = read.csv('APPENC07.txt', sep = '', header = FALSE,
col.names = real_estate_cols,
colClasses = real_estate_colclasses)
set.seed(2020)
real_estate_sampled = real_estate[sample(1:nrow(real_estate), 200, replace = FALSE), c("sales_price", "sq_ft")]
model = lm.fit_manual(real_estate_sampled$sq_ft, real_estate_sampled$sales_price)
real_estate_sampled %>%
ggplot(aes(sq_ft, sales_price)) +
geom_point() +
labs(x = "square feet",
y = "sales price",
title = "Scatterplot of Sales Price and Square Feet",
subtitle = "based on random sample of 200 points")
Using the randomly selected points, a model was created. Regression diagnostics plots are shown below.
real_estate_model = data.frame(x = real_estate_sampled$sq_ft,
y = real_estate_sampled$sales_price,
y_pred = model[1] + model[2]*real_estate_sampled$sq_ft,
resids = model[1] + (model[2]*real_estate_sampled$sq_ft) - real_estate_sampled$sales_price)
plot_1 = real_estate_model %>%
ggplot(aes(x, resids)) +
geom_point() +
labs(x = "square feet",
y = "residual",
title = "Residuals from Regressing Square Feet on Sales Price")
plot_2 = prob_plot(data.frame(x=real_estate_sampled$sq_ft, y=real_estate_sampled$sales_price), "of Regressing Square Feet on Sales Price")
grid.arrange(plot_1, plot_2, ncol = 1)
Error terms appear to have constant variance and are distributed normally.
The sales price for a house that has \(X=1100\) finished square feet is
model[1] + model[2]*1100
## [1] 99535.26
and the sales price for a house that has \(X=4900\) finished square feet is
model[1] + model[2]*4900
## [1] 687669.1
The final model will prove to be ineffective for houses that are large in square feet, as demonstrated by the giant spread of variance towards the high end of square feet in the diagnostic plot above. It also does not take into account the various other factors that go into pricing a house in real estate.
Refer to the Prostate cancer data set in Appendix C.5. Build a regression model to predict PSA level (\(Y\)) as a function of cancer volume (\(X\)). The analysis should include an assessment of the degree to which the key regression assumptions are satisfied. If the regression assumptions are not met, include and justify appropriate remedial measures. Use the final model to estimate mean PSA level for a patient whose cancer volume is \(20\) cc. Assess the strengths and weaknesses of the final model.
Answer:
prostate_cols = c('ID', 'PSA_level', 'cancer_volume', 'weight', 'age',
'benign_prostatic_hyperplasia', 'seminal_vesical_invation',
'capsular_penetration', 'gleason_score')
prostate_colclasses = c(rep('numeric', 6), 'factor', 'numeric', 'factor')
prostate = read.csv('APPENC05.txt', sep = '', header = FALSE,
col.names = prostate_cols,
colClasses = prostate_colclasses)
prostate %>%
ggplot(aes(cancer_volume, PSA_level)) +
geom_point() +
labs(x = "cancer volume",
y = "PSA level",
title = "Scatterplot of PSA Level and Cancer Volume")
Using all the points, a model was created. Regression diagnostics plots are shown below.
model = lm.fit_manual(prostate$cancer_volume, prostate$PSA_level)
prostate_model = data.frame(x = prostate$cancer_volume,
y = prostate$PSA_level,
y_pred = model[1] + model[2]*prostate$cancer_volume,
resids = model[1] + (model[2]*prostate$cancer_volume) - prostate$PSA_level)
plot_1 = prostate_model %>%
ggplot(aes(x, resids)) +
geom_point() +
labs(x = "cancer volume",
y = "residual",
title = "Residuals from Regressing Cancer Volume on PSA Level")
plot_2 = prob_plot(data.frame(x=prostate$cancer_volume, y=prostate$PSA_level), "of Regressing Cancer Volume on PSA Level")
grid.arrange(plot_1, plot_2, ncol = 1)
The model does not appear to be a good fit for the data. Error terms deviate heavily from normality and the error variance is sparse at the high end. To remedy this, a \(\log_{10}\) tranformation will be used on the PSA level (\(Y\)) variable. By doing so and creating the regression model again, the following diagnostic plots are made.
model = lm.fit_manual(prostate$cancer_volume, log10(prostate$PSA_level))
prostate_model = data.frame(x = prostate$cancer_volume,
y = log10(prostate$PSA_level),
y_pred = model[1] + model[2]*prostate$cancer_volume,
resids = model[1] + (model[2]*prostate$cancer_volume) - log10(prostate$PSA_level))
plot_1 = prostate_model %>%
ggplot(aes(x, resids)) +
geom_point() +
labs(x = "cancer volume",
y = "residual",
title = "Residuals from Regressing Cancer Volume on Log10(PSA Level)")
plot_2 = prob_plot(data.frame(x=prostate$cancer_volume, y=log10(prostate$PSA_level)), "of Regressing Cancer Volume on Log10(PSA Level)")
grid.arrange(plot_1, plot_2, ncol = 1)
By using the \(\log_{10}\) tranformation on \(Y\), the error terms are now distributed normally and have a constant variance. The regression function in its original units is:
paste("Y = 10^(", round(model[1], 3), "+", round(model[2], 3), "x)", sep = '')
## [1] "Y = 10^(0.784+0.042x)"